import openai
from tqdm import tqdm
from causal_chains.CausalChain import util # https://github.com/helliun/causal-chains
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from dotenv import load_dotenv
import os
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
import pathlib
import textwrap
In [1]:
In [2]:
= pd.read_csv("../data/corpus.csv") who_data
Causal text mining (CTM) has been applied to various NLP tasks such as knowledge base construction, question answering, and text summarization The methodologies of CTM often involve two phases: causal sequence classification and causal span detection
- The causal sequence classification is a binary classification task to detect whether the sequence entails causality or not. This task requires a deep understanding of commonsense knowledge, as determining causality necessitates the comprehension of underlying real-world principles and contexts Gao et al.
- The causal span detection task aims to distinguish between cause and effect arguments present in causal sequences. This task requires a precise understanding of a complex context that comprises multiple entities and events to discern which parts of sequences correspond to causes and effects and which are noise, in addition to the capabilities previously mentioned.
Biomedical causal relations extracted from different resources, such as online journals, books, and reports, can be leveraged to form causal chains, which may result in the discovery of previously unknown relations.
CTM include various approaches
- <font color=“#00b050”, style = “bold”>Knowledged-based system (expert opinions): relied heavily on domain experts to define rules and patterns for identifying causal relationships in text.
- Machine learning: Naive Bayes, Support Vector Machines (SVM), and Conditional Random Fields (CRF) were used to classify and extract causal relationships. These models required extensive feature engineering and relied on lexical and syntactic features such as keywords (“due to”, “can cause”), part-of-speech tags, and dependency relations. [[2024-05-13#Traditional machine learning methods]]
- Deep learning techniques
- Multiview Convolutional Neural Networks (MVC): This approach leverages multiple views of the input text to capture different aspects of the data. It can combine syntactic, semantic, and positional information to enhance causal relation extraction.
- Recurrent Neural Networks (RNN): BiLSTM (Bidirectional Long Short-Term Memory) models: These models can capture long-range dependencies in text by processing it in both forward and backward directions. Attention mechanisms are often integrated to focus on relevant parts of the text that contribute to causal relationships.
- Graph Neural Networks (GNNs): GNNs can model text as graphs, where nodes represent entities or concepts and edges represent relationships. This approach is beneficial for capturing complex causal structures.
- Transformer Models
- Bidirectional Encoder Representations from Transformers (BERT): BERT is pre-trained on large corpora and can be fine-tuned for specific tasks. It captures context from both directions, making it effective for understanding complex dependencies in text. Variants like BioBERT (for biomedical text) and ClinicalBERT are tailored for specific domains.
- ELMo (Embeddings from Language Models): ELMo generates contextualized word embeddings by considering the entire sentence, providing richer representations for identifying causal relationships.
LLMs have demonstrated impressive performance across numerous NLP tasks with zero-shot or few-shot in-context learning without requiring supervised training versus traditional encoder-based models
ChatGPT often demonstrates competitive results in few-shot settings even in financial domain-specific datasets and Japanese datasets, even though a fully trained encoder-based model outperforms ChatGPT. The result indicates that ChatGPT is a good starting point for various datasets especially when training data are unavailable, but not a good causal text miner when the training data are readily available.
The result indicates that ChatGPT serves as a good starting point when training data are limited as its performance is not influenced by the data size. In contrast, encoder models depend heavily on data size
ChatGPT struggles with complex causality types, especially those of intra/inter-sentential and implicit causality
Sample sentence: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.
Expected results:
- Cause: lack of early clinical recognition of an infection -> Effects: community transmission of mpox
- Cause: pauci-symptomatic manifestation of the disease -> Effects: lack of early clinical recognition of an infection
- Cause: delays in care-seeking behaviour -> Effects: lack of early clinical recognition of an infection
- Cause: limited access to care -> Effect: delays in care-seeking behaviour
- Cause: fear of stigma -> Effect: delays in care-seeking behaviour
In [3]:
load_dotenv()= os.getenv("OPENAI_API_KEY")
openai.api_key = os.getenv("GEMINI_API_KEY")
gemini_api_key
# Initialize the Gemini API client
=gemini_api_key)
genai.configure(api_key= {
safety_filters
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE# ... add other categories if you need them and set them to BLOCK_NONE
}
class CausalChain:
= """
one_shot_example Example of disease transmission
Text: The sudden appearance of unlinked cases of mpox in South Africa without a history of international travel, the high HIV prevalence among confirmed cases, and the high case fatality ratio suggest that community transmission is underway, and the cases detected to date represent a small proportion of all mpox cases that might be occurring in the community; it is unknown how long the virus may have been circulating. This may in part be due to the lack of early clinical recognition of an infection with which South Africa previously gained little experience during the ongoing global outbreak, potential pauci-symptomatic manifestation of the disease, or delays in care-seeking behaviour due to limited access to care or fear of stigma.
Question: Which drivers cause the emergence or transmission of an infectious disease outbreak in the region?
Answer:
Cause: limited access to care (Public Health Systems) -> Effect: delays in care-seeking behaviour (Social & Demographic Change)
Cause: fear of stigma (Social & Demographic Change) -> Effect: delays in care-seeking behaviour (Social & Demographic Change)
Cause: delays in care-seeking behaviour (Social & Demographic Change) -> Effect: lack of early clinical recognition of an infection (Public Health Systems)
Cause: pauci-symptomatic manifestation of the disease (Disease characteristics) -> Effect: lack of early clinical recognition of an infection (Public Health Systems)
Cause: lack of early clinical recognition of an infection (Public Health Systems) -> Effect: community transmission of mpox (Disease transmission)
"""
= """
two_shot_example Example of disease emergence
Text: The risk of dengue is similar across regions, countries, and within countries. Factors associated with an increasing risk of dengue epidemics and spread to new countries include: early start and longer duration of dengue transmission seasons in endemic areas; changing distribution and increasing abundance of the vectors (Aedes aegypti and Aedes albopictus); consequences of climate change and periodic weather phenomena (El Nino and La Nina events) leading to heavy precipitation, humidity, and rising temperatures favouring vector reproduction and virus transmission;
Question: Which drivers cause the emergence or transmission of an infectious disease outbreak in the region?
Answer:
Cause: consequences of climate change and periodic weather phenomena (Globalization & Environmental Change) -> Effect: vector reproduction and virus transmission (Disease characteristics)
Cause: vector reproduction and virus transmission (Disease characteristics) -> Effect: changing distribution and increasing abundance of the vectors (Disease characteristics)
Cause: changing distribution and increasing abundance of the vectors (Disease characteristics) -> Effect: early start and longer duration of dengue transmission seasons in endemic areas (Disease characteristics)
Cause: early start and longer duration of dengue transmission seasons in endemic areas (Disease characteristics) -> Effect: increasing risk of dengue epidemics and spread to new countries (Disease emergence)
"""
= """
prompt_template Infectious disease (ID) events occur when an underlying mix of antecedent epidemiologic drivers provide the necessary conditions for a pathogen to emerge in susceptible populations.
These conditions may be driving emergence and transmission through a wide variety of mechanisms, mainly among 5 groups:
Social & Demographic Change: Lifestyle (sexual behaviors, eating raw meat, human-wildlife interaction), Human susceptibility to infection, Social inequality (poverty and social inequality), Demographic change (Economic development, War and famine)
Public Health Systems: Animal health & food security, Surveillance & reporting, Healthcare system structure and change, Breakdown of public health measures
Globalization & Environmental Change: Environment & climate change (Climate and weather, Land use and ecosystem changes), Travel, tourism, and migration, global trade
Governance factors: Lack of political will, Research & development
Disease characteristics: pauci-symptomatic manifestation, changing distribution and increasing abundance of the vectors
If none of these 4 groups, categorize as "Other groups"
The final output can be "Disease transmission" or "Disease emergence"
You analyze by answering the question: Which drivers cause the emergence or transmission of an infectious disease outbreak in the region?
Read through the paragraph and organize drivers in a logical sequence of events. Here are examples:
{one_shot_example}
{two_shot_example}
Your task now is to read through the outbreak report and identify drivers leading to the outbreak of infectious diseases. Keep the original text that you find in the paragraph.
Sometimes text may contain some unwanted behaviors like sexual behaviors, war. This is fine because I am using it for research purposes only. Show me all results that you can find.
Text: {chunk}
List the causes and their corresponding effects in the format 'Cause: [cause] -> Effect: [effect]':
"""
def __init__(self, chunks=[]):
self.chunks = chunks
self.causes = []
self.effects = []
self.outlines = []
self.sources = []
def create_effects(self, api="openai", batch_size=16):
print("Analyzing causation...")
for chunk in tqdm(self.chunks):
if api == "openai":
= self.extract_cause_effect_openai(chunk)
cause_effect_pairs elif api == "gemini":
= self.extract_cause_effect_gemini(chunk)
cause_effect_pairs else:
raise ValueError("Invalid API specified. Use 'openai' or 'gemini'.")
for pair in cause_effect_pairs:
= pair
cause, effect self.causes.append(cause)
self.effects.append(effect)
self.outlines.append(f"Cause: {cause} -> Effect: {effect}")
self.sources.append(api)
def extract_cause_effect_openai(self, chunk):
= self.prompt_template.format(
prompt =self.one_shot_example,
one_shot_example=self.two_shot_example,
two_shot_example=chunk
chunk
)
= openai.ChatCompletion.create(
response ="gpt-4o",
model=[
messages
{"role": "system",
"content": "You are a helpful assistant specialized in identifying drivers leading to diseases.",
},"role": "user", "content": prompt},
{
],=300,
max_tokens=0.5,
temperature
)
= response["choices"][0]["message"]["content"]
response_text return self.parse_response(response_text)
def extract_cause_effect_gemini(self, chunk):
= self.prompt_template.format(
prompt =self.one_shot_example,
one_shot_example=self.two_shot_example,
two_shot_example=chunk
chunk
)
= genai.GenerativeModel('gemini-1.5-pro').generate_content(
response
prompt,= safety_filters
safety_settings
)= response.text
response_text return self.parse_response(response_text)
@staticmethod
def parse_response(response_text):
= []
cause_effect_pairs for line in response_text.split("\n"):
if "Cause:" in line and "-> Effect:" in line:
= line.split("Cause:")[1].split("-> Effect:")[0].strip()
cause = line.split("-> Effect:")[1].strip()
effect
cause_effect_pairs.append((cause, effect))return cause_effect_pairs
def create_causes_effects_dataframe(causes, effects, sources):
def split_cause_effect(value):
if "(" in value and ")" in value:
= value.rsplit("(", 1)
main_text, group = main_text.strip()
main_text = group[:-1].strip() # Remove the closing parenthesis
group return main_text, group
return value, "Unknown"
= zip(*[split_cause_effect(cause) for cause in causes])
cause_texts, cause_groups = zip(*[split_cause_effect(effect) for effect in effects])
effect_texts, effect_groups
= {
data "Cause": cause_texts,
"Cause_group": cause_groups,
"Effect": effect_texts,
"Effect_group": effect_groups,
"Source": sources
}
= pd.DataFrame(data)
df return df
Example of text to ask LLMs
In the Democratic Republic of the Congo, most reported cases in known endemic provinces continue to be among children under 15 years of age, especially in young children. Infants and children under five years of age are at highest risk of severe disease and death, particularly where prompt optimal case management is limited or unavailable. The number of cases reported weekly remains consistently high while the outbreak continues to expand geographically. High test positivity among tested cases in most provinces also suggests that undetected transmission is likely ongoing in the community. Transmission of mpox due to clade I MPXV via sexual contact in key populations was first identified in the Democratic Republic of the Congo in 2023. In South Kivu province, mpox transmission is sustained through human-to-human contact (sexual and non-sexual)
In [128]:
= who_data["Text"][9]
text = util.create_chunks(text)
chunks = CausalChain(chunks) cc
In [130]:
="openai") cc.create_effects(api
Analyzing causation...
Analyzing causation...
0%| | 0/12 [00:00<?, ?it/s] 8%|███▌ | 1/12 [00:03<00:40, 3.67s/it] 17%|███████▏ | 2/12 [00:06<00:30, 3.09s/it] 25%|██████████▊ | 3/12 [00:10<00:32, 3.58s/it] 33%|██████████████▎ | 4/12 [00:12<00:24, 3.01s/it] 42%|█████████████████▉ | 5/12 [00:15<00:19, 2.78s/it] 50%|█████████████████████▌ | 6/12 [00:17<00:16, 2.83s/it] 58%|█████████████████████████ | 7/12 [00:20<00:14, 2.84s/it] 67%|████████████████████████████▋ | 8/12 [00:23<00:11, 2.85s/it] 75%|████████████████████████████████▎ | 9/12 [00:26<00:08, 2.76s/it] 83%|███████████████████████████████████ | 10/12 [00:28<00:05, 2.72s/it] 92%|██████████████████████████████████████▌ | 11/12 [00:33<00:03, 3.40s/it]100%|██████████████████████████████████████████| 12/12 [00:37<00:00, 3.49s/it]100%|██████████████████████████████████████████| 12/12 [00:37<00:00, 3.13s/it]
In [129]:
="gemini") cc.create_effects(api
Analyzing causation...
Analyzing causation...
0%| | 0/12 [00:00<?, ?it/s] 8%|███▌ | 1/12 [00:12<02:12, 12.07s/it] 17%|███████▏ | 2/12 [00:19<01:31, 9.12s/it] 25%|██████████▊ | 3/12 [00:21<00:56, 6.25s/it] 33%|██████████████▎ | 4/12 [00:29<00:54, 6.78s/it] 42%|█████████████████▉ | 5/12 [00:41<00:59, 8.53s/it] 50%|█████████████████████▌ | 6/12 [00:46<00:44, 7.37s/it] 58%|█████████████████████████ | 7/12 [00:49<00:30, 6.01s/it] 67%|████████████████████████████▋ | 8/12 [01:01<00:31, 7.87s/it] 75%|████████████████████████████████▎ | 9/12 [01:02<00:17, 5.90s/it] 83%|███████████████████████████████████ | 10/12 [01:05<00:09, 4.79s/it] 92%|██████████████████████████████████████▌ | 11/12 [01:12<00:05, 5.68s/it]100%|██████████████████████████████████████████| 12/12 [01:18<00:00, 5.75s/it]100%|██████████████████████████████████████████| 12/12 [01:18<00:00, 6.57s/it]
In [135]:
= create_causes_effects_dataframe(cc.causes, cc.effects, cc.sources) df
In [136]:
'Source'] == 'gemini']) display(df[df[
Cause | Cause_group | Effect | Effect_group | Source | |
---|---|---|---|---|---|
0 | Limited availability of prompt optimal case ma... | Public Health Systems | Infants and children under five years of age a... | Social & Demographic Change)* | gemini |
1 | ** human-to-human contact | sexual and non-sexual) * | ** Transmission of mpox | Disease Transmission | gemini |
2 | ** sexual contact in key populations ** | Unknown | ** Transmission of mpox due to clade I MPXV | Disease Transmission | gemini |
3 | ** undetected transmission in the community ** | Unknown | ** High test positivity among tested cases | Disease Transmission | gemini |
4 | **lack of timely access to diagnostics in many... | Public Health Systems | **incomplete epidemiological investigations** | Public Health Systems | gemini |
5 | **incomplete epidemiological investigations** | Public Health Systems | **challenges in contact tracing** | Public Health Systems | gemini |
6 | **challenges in contact tracing** | Public Health Systems | **the outbreak in South Kivu is already spread... | Disease transmission | gemini |
7 | eradication of smallpox | Public Health Systems | immunity gap | Social & Demographic Change | gemini |
8 | MPXV continues to move into the immunity gap | Disease characteristics | human-to-human transmission | Disease transmission | gemini |
9 | logistical and resource challenges | Public Health Systems | limited Surveillance and investigating alerts | Public Health Systems | gemini |
10 | limited laboratory capacities | Public Health Systems | limited Surveillance and investigating alerts | Public Health Systems | gemini |
11 | lack of effective dissemination to date of hea... | Public Health Systems | low awareness of the risks associated with mpox | Social & Demographic Change)* | gemini |
12 | low awareness of the risks associated with mpox | Social & Demographic Change | exposes them to further risk | Disease Transmission)* | gemini |
13 | ** River boat travel | Globalization & Environmental Change) * | ** Outbreaks in Kinshasa | Disease transmission | gemini |
14 | ** Co-infections with HIV and other sexually t... | Social & Demographic Change) * | ** Increased severity of MPXV | Disease Transmission | gemini |
In [137]:
'Source'] == 'openai']) display(df[df[
Cause | Cause_group | Effect | Effect_group | Source | |
---|---|---|---|---|---|
15 | limited or unavailable prompt optimal case man... | Public Health Systems | high risk of severe disease and death in infan... | Social & Demographic Change | openai |
16 | high risk of severe disease and death in infan... | Social & Demographic Change | consistently high number of cases reported weekly | Disease transmission | openai |
17 | consistently high number of cases reported weekly | Disease transmission | outbreak continues to expand geographically | Disease transmission | openai |
18 | high test positivity among tested cases in mos... | Public Health Systems | undetected transmission likely ongoing in the ... | Disease transmission | openai |
19 | sexual contact in key populations | Social & Demographic Change | transmission of mpox due to clade I MPXV | Disease transmission | openai |
20 | human-to-human contact (sexual and non-sexual)... | Social & Demographic Change | sustained mpox transmission | Disease transmission | openai |
21 | sexual contact | Social & Demographic Change | faster transmission | Disease transmission | openai |
22 | immune suppression, especially among those wit... | Social & Demographic Change | risk factors for severe disease and death amon... | Disease transmission | openai |
23 | prevalence of HIV in the general adult populat... | Social & Demographic Change | higher risk of severe disease and death among ... | Disease transmission | openai |
24 | sustained human-to-human sexual transmission o... | Social & Demographic Change | additional public health impact | Public Health Systems | openai |
25 | higher HIV prevalence in the eastern provinces | Social & Demographic Change | higher risk of severe disease and death among ... | Disease transmission | openai |
26 | lack of timely access to diagnostics in many a... | Public Health Systems | incomplete epidemiological investigations | Public Health Systems | openai |
27 | incomplete epidemiological investigations | Public Health Systems | challenges in contact tracing | Public Health Systems | openai |
28 | challenges in contact tracing | Public Health Systems | outbreak spreading into the wider community | Disease transmission | openai |
29 | outbreak spreading into the wider community | Disease transmission | occurrence of cases among a broad range of occ... | Disease transmission | openai |
30 | new features of human-to-human transmission | Disease characteristics | further rapid expansion of the outbreak | Disease transmission | openai |
31 | further rapid expansion of the outbreak | Disease transmission | geographic expansion to new areas, such as Kin... | Disease transmission | openai |
32 | geographic expansion to new areas, such as Kin... | Disease transmission | increase in suspected cases reported | Disease transmission | openai |
33 | travel from endemic areas | Globalization & Environmental Change | cases in newly affected provinces | Disease transmission | openai |
34 | secondary or sustained human-to-human transmis... | Disease transmission | cases in newly affected provinces | Disease transmission | openai |
35 | immunity gap left following eradication of sma... | Social & Demographic Change | MPXV continues to move | Disease transmission | openai |
36 | logistical and resource challenges | Public Health Systems | limited surveillance and investigating alerts | Public Health Systems | openai |
37 | limited laboratory capacities | Public Health Systems | limited surveillance and investigating alerts | Public Health Systems | openai |
38 | reliance on the support of WHO and other partners | Public Health Systems | response capacities to mpox in the country | Public Health Systems | openai |
39 | ongoing immunogenicity and safety studies of M... | Governance factors | national immunization technical advisory group... | Governance factors | openai |
40 | national immunization technical advisory group... | Governance factors | use of mpox vaccines in the country for person... | Public Health Systems | openai |
41 | recommendations for preferred use of LC16 in c... | Governance factors | immunization strategy for different age groups | Public Health Systems | openai |
42 | intention to vaccinate persons at risk | Governance factors | use of LC16 and MVA-BN vaccinia-based mpox vac... | Governance factors | openai |
43 | request for authorization of temporary use of ... | Governance factors | regulatory review by ACOREP | Governance factors | openai |
44 | regulatory review by ACOREP | Governance factors | temporary use of these vaccines | Governance factors | openai |
45 | planning of further clinical efficacy and safe... | Governance factors | further clinical efficacy and safety studies f... | Governance factors | openai |
46 | developing emergency response immunization str... | Public Health Systems | persons and areas at risk are targeted | Public Health Systems | openai |
47 | extensive consultation internally, with WHO an... | Governance factors | development of emergency response immunization... | Public Health Systems | openai |
48 | clinical efficacy studies of tecovirimat | Governance factors | potential future access to tecovirimat | Public Health Systems | openai |
49 | study expected to complete recruitment in 2024 | Governance factors | delayed access to tecovirimat until study comp... | Public Health Systems | openai |
50 | low awareness of the risks associated with mpo... | Public Health Systems | increased risk of disease transmission in the ... | Disease transmission | openai |
51 | lack of effective dissemination of health mess... | Public Health Systems | increased risk for key populations such as sex... | Social & Demographic Change | openai |
52 | increased risk for key populations such as sex... | Social & Demographic Change | further exposure to mpox | Disease transmission | openai |
53 | co-infections with HIV and other sexually tran... | Social & Demographic Change | outbreaks in newly reported areas in southern ... | Disease transmission | openai |
54 | river boat travel | Globalization & Environmental Change | outbreaks in the city of Kinshasa | Disease transmission | openai |
55 | under detection or underreporting of transmission | Public Health Systems | significant under detection or underreporting ... | Public Health Systems | openai |
56 | resources to respond over such a wide geograph... | Public Health Systems | insufficient response to the outbreak | Public Health Systems | openai |
57 | resource mobilization is slow | Public Health Systems | insufficient response to the outbreak | Public Health Systems | openai |
58 | public awareness remains limited | Social & Demographic Change | insufficient response to the outbreak | Public Health Systems | openai |
59 | resources are scarce | Public Health Systems | insufficient response to the outbreak | Public Health Systems | openai |
60 | technical and financial support is needed | Public Health Systems | insufficient response to the outbreak | Public Health Systems | openai |
61 | insufficient response to the outbreak | Public Health Systems | continuation of disease transmission | Disease transmission | openai |